By the end of this lab, you will: 1. Load and analyze the Lightcast dataset in Spark DataFrame. 2. Create five easy and three medium-complexity visualizations using Plotly. 3. Explore salary distributions, employment trends, and job postings. 4. Analyze skills in relation to NAICS/SOC/ONET codes and salaries. 5. Customize colors, fonts, and styles in all visualizations (default themes result in a 2.5-point deduction). 6. Follow best practices for reporting on data communication.
Step 1: Load the Dataset
import pandas as pdimport plotly.express as pximport plotly.io as piopio.renderers.default ="vscode"from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col# Initialize Spark Sessionspark = SparkSession.builder.appName("LightcastData").getOrCreate()# Load Datadf = spark.read.option("header", "true").option("inferSchema", "true").option("multiLine","true").option("escape", "\"").csv("lightcast_job_postings.csv")# Show Schema and Sample Datadf.printSchema()df.show(5)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/25 02:58:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/03/25 02:59:15 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
ValueError:
Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
$ pip install -U kaleido
First I cleaned up the employment type labels in the DataFrame for readability in the plot. From the box plot, we can observe that full-time employment has a significant number of outliers on the higher end of the salary distribution. In contrast, part-time (≤ 32 hours) shows a more tightly concentrated distribution with fewer extreme values. The interquartile range for part-time/full-time jobs appears narrower, indicating lower variability within that category. Also, the median salary for full-time roles is the highest among all groups, while part-time roles have the lowest median salaries.
2 Salary Distribution by Industry
Compare salary variations across industries.
Filter the dataset
Keep records where salary is greater than zero.
Aggregate Data
Group by NAICS industry codes.
Visualize results
Create a box plot where:
X-axis = NAICS2_NAME
Y-axis = SALARY_FROM
Customize colors, fonts, and styles.
Explanation: Write two sentences about what the graph reveals.
# Your code for 2nd question herefiltered_df = df.filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") !=0))industry_df = filtered_df.groupBy("NAICS2_NAME").agg( count("*").alias("job_count"))industry_df.show()
filtered_df = df.filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") >0)& (col("NAICS2_NAME").isin("Health Care and Social Assistance","Finance and Insurance", "Information")))salary_df = filtered_df.select("NAICS2_NAME", "SALARY_FROM")salary_df.write.mode("overwrite").option("header", True).csv("industry_salaries")files = glob.glob("industry_salaries/part-*.csv")pdf = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)pio.renderers.default ="notebook_connected"fig = px.box( pdf, x="NAICS2_NAME", y="SALARY_FROM", title="Salary Distribution by Industry", color="NAICS2_NAME", color_discrete_sequence = ['rgb(67,67,67)', 'rgb(115,115,115)', 'rgb(49,130,189)'])fig.update_layout( font_family="Aptos", title_font_size=17, xaxis=dict(title=dict(text='Industry Type'), zeroline=False), yaxis=dict(title=dict(text='Salary'), zeroline=False))fig.show()
From this box plot, we can observe the salary distributions for the Information, Finance & Insurance, and Healthcare & Social Assistance industries. Among these, the Information industry has the highest median salary, while the Healthcare & Social Assistance industry shows the lowest median salary. But, also we see a wide distribution of data in the healthcare industry. This suggests that job roles in the Information sector tend to be more highly compensated compared to the other two industries.
3 Job Posting Trends Over Time
Analyze how job postings fluctuate over time.
Aggregate Data
Count job postings per posted date (POSTED).
Visualize results
Create a line chart where:
X-axis = POSTED
Y-axis = Number of Job Postings
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Your code for 3rd question hereposted_df = df.filter(col("POSTED").isNotNull())job_post_df = df.filter(col("POSTED").isNotNull()) \ .groupBy("POSTED") \ .agg(count("*").alias("job_count"))job_post_df.write.mode("overwrite").option("header", True).csv("job_posts")files = glob.glob("job_posts/part-*.csv")pdf = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)pdf["POSTED"] = pd.to_datetime(pdf["POSTED"])pdf = pdf.sort_values("POSTED") pio.renderers.default ="notebook_connected"fig = px.line( pdf, x="POSTED", y="job_count", title="Job Postings Over Time", color_discrete_sequence = ["rgb(49,130,189)"])fig.update_layout( font_family="Aptos", title_font_size=17, xaxis_title="Date Posted", yaxis_title="Number of Job Postings")fig.show()
From this line graph we can see the number of job postings from may of 2024 to around october. We can see that ther has alway been a fluctuation of job postings day to day, however we can see a slight decrease a valley in the data from july to early september. I think that probably around this time number of job postings have been decreasing.
4 Top 10 Job Titles by Count
Identify the most frequently posted job titles.
Aggregate Data
Count the occurrences of each job title (TITLE_NAME).
Select the top 10 most frequent titles.
Visualize results
Create a bar chart where:
X-axis = TITLE_NAME
Y-axis = Job Count
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Your code for 4th question heretop_10_jobs.write.mode("overwrite").option("header", True).csv("top_10_jobs")files = glob.glob("top_10_jobs/part-*.csv")pdf4 = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)pio.renderers.default ="notebook_connected"fig = px.bar( pdf4, x="TITLE_NAME", y="job_count", title="Top 10 Jobs", color ="TITLE_NAME")fig.update_layout( font_family="Aptos", title_font_size=17, xaxis_title="Top 10 Jobs", yaxis_title="Number of Job Postings")fig.show()
From this bar graph, we can see that Data Analyst roles are by far the most sought after in the current job market. Among the top 10 job titles, many are data-related, but Data Analyst stands out as the most in demand position.
5 Remote vs On-Site Job Postings
Compare the proportion of remote and on-site job postings.
Aggregate Data
Count job postings by remote type (REMOTE_TYPE_NAME).
Visualize results
Create a pie chart where:
Labels = REMOTE_TYPE_NAME
Values = Job Count
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Your code for 5th question hereremote_jobs = df.filter(col("REMOTE_TYPE_NAME").isNotNull()) \ .groupBy("REMOTE_TYPE_NAME") \ .agg(count("*").alias("job_count"))remote_jobs.show()
remote_jobs.write.mode("overwrite").option("header", True).csv("remote_jobs")files = glob.glob("remote_jobs/part-*.csv")pdf5 = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)pio.renderers.default ="notebook_connected"custom_color = ["#4B0082", "#6A0DAD", "#8A2BE2", "#A678F1", "#C8A2C8", "#E6E6FA"] fig = px.pie( pdf5, names="REMOTE_TYPE_NAME", values="job_count", title="Remote vs On-Site Jobs", color_discrete_sequence=custom_color)fig.update_layout( font_family="Aptos", title_font_size=17)fig.show()
This is a pie chart showing the percentages of remote and on site jobs. From the graph we can see that most of the jobs listed in the data, 78.1% of the jobs are listed as none, which means the job posting did not specify it. But, we can see that 17.2% jobs have been listed as remote and 3.12% have been listed as hybrid.
6 Skill Demand Analysis by Industry (Stacked Bar Chart)
Identify which skills are most in demand in various industries.
Aggregate Data
Extract skills from job postings.
Count occurrences of skills grouped by NAICS industry codes.
Visualize results
Create a stacked bar chart where:
X-axis = Industry
Y-axis = Skill Count
Color = Skill
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Your code for 6th question herefrom pyspark.sql.functions import split, explode, trimskills_df = df.filter( col("SKILLS_NAME").isNotNull() & col("NAICS2_NAME").isNotNull() & col("NAICS2_NAME").isin("Health Care and Social Assistance","Finance and Insurance","Information" ))skills_df = skills_df.withColumn("SKILL", explode(split(col("SKILLS_NAME"), ",")))skills_df = skills_df.withColumn("SKILL", trim(col("SKILL")))industry_skills_df = skills_df.groupBy("NAICS2_NAME", "SKILL") \ .agg(count("*").alias("skill_count"))
import globindustry_skills_df.write.mode("overwrite").option("header", True).csv("industry_skills")files = glob.glob("industry_skills/part-*.csv")pdf6 = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)import plotly.express as pximport globfig = px.bar( pdf6, x="NAICS2_NAME", y="skill_count", color="SKILL", title="Top Skills by Industry", barmode="stack")fig.update_layout( font_family="Aptos", title_font_size=18, xaxis_title="Industry", yaxis_title="Skill Count")fig.show()
From this bar graph we can see that one of the most sought after sill is data warehousing in finance and insurance industry, and the second one is the leadership skill. And in the next two industries the most sought after skill is computer science, i am not sure what level of programming or IT knowledge the companies are seeking, but I think haveing basic programming skill would be neccesary.
7 Salary Analysis by ONET Occupation Type (Bubble Chart)
Analyze how salaries differ across ONET occupation types.
Aggregate Data
Compute median salary for each occupation in the ONET taxonomy.
Visualize results
Create a bubble chart where:
X-axis = ONET_NAME
Y-axis = Median Salary
Size = Number of job postings
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
This is a bubble chart showing median salary by ONET occupation type. However, because our data has only one type of occupation type, which is Business Inteligence Analyst we get only one bubble for our bubble chart.
8 Career Pathway Trends (Sankey Diagram)
Visualize job transitions between different occupation levels.
Aggregate Data
Identify career transitions between SOC job classifications.
Visualize results
Create a Sankey diagram where:
Source = SOC_2021_2_NAME
Target = SOC_2021_3_NAME
Value = Number of transitions
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Your code for 8th question herecareer_path = df.filter(col("SOC_2021_2_NAME").isNotNull() & col("SOC_2021_3_NAME").isNotNull()) career_path = career_path.select("SOC_2021_2_NAME", "SOC_2021_3_NAME").agg(count("*").alias("job_count"))career_path.show()#career_list = [["Comp and Math", "Math Science",], [72454, 72454]]#career_path_df = pd.DataFrame(career_list, columns=["SOC_2021_2_NAME", "SOC_2021_3_NAME" ])#print(career_path_df)career_list = [["Comp and Math", "Math Science", 72454]]career_path_df = pd.DataFrame(career_list, columns=["SOC_2021_2_NAME", "SOC_2021_3_NAME", "job_count"])
import plotly.graph_objects as go# a unique list of all node labelsall_labels =list(set(career_path_df["SOC_2021_2_NAME"].tolist() + career_path_df["SOC_2021_3_NAME"].tolist()))label_to_index = {label: i for i, label inenumerate(all_labels)}# Map source and target labels to their indexcareer_path_df["source"] = career_path_df["SOC_2021_2_NAME"].map(label_to_index)career_path_df["target"] = career_path_df["SOC_2021_3_NAME"].map(label_to_index)# Sankey chartfig = go.Figure(data=[go.Sankey( node=dict( pad=15, thickness=20, label=all_labels, color = ["#FFB6C1", "#FFC0CB"] ), link=dict( source=career_path_df["source"], target=career_path_df["target"], value=career_path_df["job_count"], color = ["#F8BBD0"] ))])fig.update_layout( title_text="Career Flow SOC 2 to SOC 3", font=dict(size=15, color="#333", family="Aptos"), paper_bgcolor="#fff0f5"# light lavender-pink background)fig.show()
The width of the flow represents the number of job postings, highlighting the volume moving between two career levels however because it is not hierarhcical adn we only have one to one value, that is why our Sankey plot is one big chunk flowing to the other. I think if we have more hierarchical data, I think it might be more interesting to visualize.